White Wine EDA by Amrut Deshpande

All necessary packages are loaded.

Data is read.

A. UNIVARIATE ANALYSIS

Univariate Plots Section

To begin with the univariate plot, I have initially observed the dimensions, names and structureof the data. This gives a brief idea about what the data deals with and how the data has to be observed. The summary of the wine dataset aids in calculating various statistical results.

## [1] 4898   13

With the dim command, i now know that there are 4898 observations and 13 variables in the white wine data set.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

There are 13 variables and the names are listed above.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The str command shows the data type of each variable and the fist few readings of each of them.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

The head command shows the headers of all the variables and the first few readings of each variable.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

With summary, weinfer few statistics of each variable like mean, median and quadrant values. Below are my observations,

The median quality is 6 and the mean is 5.878, so a 6 rating is actually closer to average than a 5 for this dataset. Even though quality is meant to be a rating from 0 to 10, only 3 through 9 are used.

All of the features have a minimum value greater than 0 except for citric acid. Most pH values fall between 3 and 3.3.

Residual sugar, free sulphur dioxide and total sulfur dioxide seem to have distributions which are unusual since its maximum value are relatively higher compared to the mean and median. Also, the 3rd quadrant values and the maximum values have a large difference.

The residual sugar values greater than 45 would be considered sweet according to the variable guidelines. I believe there wont be many observations of sweet white wines as the 3rd quartile value is only 9.9.

## 
## FALSE  TRUE 
##  4897     1
## 
## FALSE  TRUE 
##  3669  1229
## 
## FALSE  TRUE 
##  3656  1242
## 
## FALSE  TRUE 
##  3647  1251

Observing the number of instances above 3rd quadrant values, i conclude that the disrtributions won’t be that interesting to infer more as the true and false ratio is close to 1:3 for all three variables.

Also, there is only one observation which is considered sweet white wine based on the guidelines.

Histogram plots of variables

Observing the histogram of all the variables, I identified quality to be having instances at particular values i.e, it represents discrete ratings levels that are ordered from 3 to 9 as seen in the summary and histogram.

I had guessed based on the summary that the most common wine quality rating is 6 with 2198 observations. And further, 5 and 7 are 2nd and 3rd most common observations. With so few observations at 3 and 9, those ratings can be considered as very poor and very good respectively.

Fixed acidity and Volatile acidity have outliers with low and high extreme values with least count.

Citric acid acidity too has outliers with low and high extreme values with least count but mainly, I observe a sharp peak with over 200 counts at 0.5 value.

Residual sugar follows skewed distribution, and there is a sharp peak observed at low residual value. This needs attention.

Free sulfur dioxide, Total sulfur dioxide and Density follows pretty good normal distributionwith the exeption of outliers at its high extreme end. Total sulfur dioxide has a dense region between 100 to 200 range and Density with dense values ranging between 0 to 1. There are outliers at high extreme values.

pH variables reveals the best normal distribution so far, and majority wine samples have pH around 3.1 to 3.3. There are outliers samples with pH values greater than 3.6 and less than 2.8. Sulphates highest count is observed between 0.4 and 0.5 range.

Alcohol distribution is kind of unusual as it is more deviated towards right and in decreasing order but follows a normal distribution on the whole.

Finally, I believe the quality variable should be made an ordered factor. So, I went ahead and changed the data type to prep it for future analysis.

##  Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

adjusting binwidth of the variables

To accurately follow the curve of the data, i am adjusting the binwidths on these charts.

Above histograms show normal distribution for most of the variables except residual sugar. Residual sugar has long tailed distribution and I will use transformations to obtain better representation for this variable.

Also in the distribution for some of the variables, the bell curve shifts towards left, this may be due to traces of outliers present. I would like to investigate it in future analysis.

transformation of residual sugar variable

Except the residual sugar, which shows positive skewed distribution, rest all the variables follow normal distribution.Hence i shall transform the residual sugar variable by using sqrt and log transformation on the x axis to nullify the skewed effect.

The sqrt transformation doesn’t have any impact on the residual sugar variable but the log10 transformation definitely converts the data into a normal distribution. Therefore, we use this new transformed data for further analysis.

frequency polygon of residual sugar variable

The distribution can also be shown by freqency polygon and i have plotted the same for the pH value variable below

Even the frequency polygon shows that the residual sugar is following bimodal distribution.

Removing outliers

In the distribution for some of the variables, the bell curve shifts towards left, this may be due to traces of outliers present. I would like to investigate it now by removing outliers

Distribution appears better for most of the variables after removing the outliers. Response variable distribution is analyzed in the following section.

As noted earlier, citric acid is the only variable with some 0 values. However, its histogram shows that the 0 values are a very small percentage of the total dataset. This is noteworthy because we can consider all variables to contribute a nonzero value to the entries in the dataset.

We see a normal distribution with most of the wine quality being rated 4 and there is less wines with excellent and poor condition.

Creating new variable

After reading the variable descriptions for the white wine data, I would like to create a new field for ratio of free sulphur dioxide and total sulfhur dioxide and the ratio of residual sugar variable and alcohol. The data currently has features for free sulfur dioxide and total sulfur dioxide, so going forward, I can use these as additional features.

##  ft_sulfur.dioxide      rsa        
##  Min.   :0.02362   Min.   :0.0566  
##  1st Qu.:0.19093   1st Qu.:0.1575  
##  Median :0.25368   Median :0.4906  
##  Mean   :0.25558   Mean   :0.6423  
##  3rd Qu.:0.31579   3rd Qu.:0.9773  
##  Max.   :0.71053   Max.   :5.6239
## 'data.frame':    4898 obs. of  2 variables:
##  $ ft_sulfur.dioxide: num  0.265 0.106 0.309 0.253 0.253 ...
##  $ rsa              : num  2.352 0.168 0.683 0.859 0.859 ...
##  quality 
##  3:  20  
##  4: 163  
##  5:1457  
##  6:2198  
##  7: 880  
##  8: 175  
##  9:   5

Residual sugar by alcohol ratio i.e rsa variable shows some unususal maximum value deviating largely from its mean. This may be due to sweet wine having high residual sugar level after fermemtation.

Free to total sulfur dioxide graph looks normally distributed. Sugar to alcohol ratio shows bimodal behaviour as exhibited by their idividual plots alone.

UNIVARIATE ANALYSIS

What is the structure of your dataset?

The original dataset comprised of 4898 observations with 12 variables(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol). All the attributes are quantitative in nature containing number ranges thus none of them are factor variables.

The final variable,response variable quality is an ordered factor as it is classified as ratings. Quality variable’s actual range is from 3 to 9 in this dataset observations though its wine scores are from 1 to 10.

Most of the wines taste is considered good quality as the median of the ratings is 6 and the mean is 5.878. Most pH values fall between 3 and 3.3 and almost 75% of wine samples have quality ratings 5 and 6, with majority of the rating being 6.A quality rating of 6 was used almost 1,000 times more than the second most common rating of 5.

All of the features have a minimum value greater than 0 except for citric acid and 19 wine samples have no citric acid present to add to their freshness.Sugar shows a bimodal behavious in the histogram which means that there are two groups of wine in the dataset i.e one with more sugar and other with less sugar.

There is one wine whose residual sugar value is 65.8 and according to the description of attributes, this wine is considered sweet. Fixed acidity and residual sugar have the highest medians of any of the variables measured in g/L.

What is/are the main feature(s) of interest in your dataset?

The main interest of the dataset is of the response variable Quality. I am performing the analysis to determine the effect of significant variables on quality. I also sense that alcohol level will impact the quality of wine and so it becomes another variable of interest.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I feel, all 11 variables can have an impact in the quality model. I believe the extreme condition in terms of mean and median will have the largest impact. I would like to investigate the new variable rsa which is sugar to alcohol ratio as it showed bimodal behaviour. Residual sugar will also be scrutinized in future analysis.

Did you create any new variables from existing variables in the dataset?

Exploring the dataset, i created a new variable for sulphur dioxide taking the ratio of free to total sulphur dioxide and also SA variable to investigate the sugar to alcohol ratio that might affect the quality of wine.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The quality variable was changed to an ordered factor which will aid the analysis to perform better. Therefore the quality has the ratings observed from 3 to 9.

Log transformation was performed on residual sugar hoping for a normal distribution on the variable. But perhaps, I saw the result showed a bimodal distribution which means that the groupings of measurements at extreme values. There will be further investigation if there are any unusual behaviour observed so far.

rsa and Residual sugar variable shows a lot of deviation from its mean and it is unusual as its maximum value .

B. BIVARIATE ANALYSIS

Bivariate Plots Section

I will be including png image of the same to my project for better visualization and interpretation as the chart is difficult to read with huge chunks of feature pairs.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and new_quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747
## 
##  Pearson's product-moment correlation
## 
## data:  new_quality and density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233
## 
##  Pearson's product-moment correlation
## 
## data:  new_quality and chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344

I am observing moderate correlation of variables with the quality variable. Additionally, to validate the reason for such correlation i examined these pairs via plots.

Box plots are plotted to Explore bivariate plots of quality with rest of the features.

I observe quite a few outliers from the 75% Box plot and to further investigate I am creating the scatterplot for the top corelating variables.

To enhance the clarity of the plots generated above, I am changing the transperancy to prevent overplotting, adding jitter for better visualization and limiting the axis to remove outliers observed.

The plot between quality and alcohol with jitter shows better clarity infering the correlation between them is significant. I observe a tiny upward shift from left to right and it can also be seen in the box plot.

The quality level goes uptil 9 when the alcohol content increases from 11 to 13. Therefore, more alcohol content is preferred in their wine for better quality.

There is a negative correlation between quality and density in the plot which confirms wthe previously noted negative corelation. This states that as the quality level increases we tend to notice the decrease in the range of density values.

The downward shift observed from left to right confirms that high quality ratings have lesser density values.

The plot between quality and chlorides shows minimal negative correlation between them. There is dense plot of chloride points observed between 0.02 to 0.06. Chloride values greater than 0.08 have quality ratings between 4 to 6 approximately.

Further inference is being carried out on the relation between quality and alcohol as i find it more significant among all.

The histogram confirms the inference noted previously that as alochol content increases and the quality of the wine increases too. This is visible in the histogram with quality level being more than 7 for higher alcohol contents. I want to explore the summary of alchol after grouping with quality.

## # A tibble: 7 x 5
##   quality mean_alcohol median_alcohol min_alcohol max_alcohol
##     <ord>        <dbl>          <dbl>       <dbl>       <dbl>
## 1       3     10.34500          10.45         8.0        12.6
## 2       4     10.15245          10.10         8.4        13.5
## 3       5      9.80884           9.50         8.0        13.6
## 4       6     10.57537          10.50         8.5        14.0
## 5       7     11.36794          11.40         8.6        14.2
## 6       8     11.63600          12.00         8.5        14.0
## 7       9     12.18000          12.50        10.4        12.9

This summary lists the alchol content present for each quality rating and gives valuable insight.

The histogram reveals that there are higher quality ratings at lower density levels which is lesser than 0.992.

## # A tibble: 7 x 5
##   quality mean_density median_density min_density max_density
##     <ord>        <dbl>          <dbl>       <dbl>       <dbl>
## 1       3    0.9948840       0.994425     0.99110     1.00010
## 2       4    0.9942767       0.994100     0.98920     1.00040
## 3       5    0.9952626       0.995300     0.98722     1.00241
## 4       6    0.9939613       0.993660     0.98758     1.03898
## 5       7    0.9924524       0.991760     0.98711     1.00040
## 6       8    0.9922359       0.991640     0.98713     1.00060
## 7       9    0.9914600       0.990300     0.98965     0.99700

We can now declare that Highest quality ratings have least density.

I am concluding that Alcohol variable is of utmost interest and i am proceeding to build a linear model for the same.

I will be noting down the predictions which i get from the model.

## 
## Call:
## lm(formula = I(new_quality) ~ I(alcohol), data = ww)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.582009   0.098008   5.938 3.08e-09 ***
## I(alcohol)  0.313469   0.009258  33.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

This scatterplot does not confirm much of linearity between this relationship as the points are way too scattered for linearity. The Summary of linear model shows that despite observing moderate correaltion between these two variables, the R squared value is too low infering that the predictive capability of the model is less. This model however follows normal distribution.

I feel non-linear fit should be considered by introducing either new variable or higher order terms to the model to increase the predictive power of the model.

Here, in Bivariate analysis, I will be focusing on the correlations between all the variables keeping the quality variable away as that is the main feature interest.

The correlations oberserved are as follows

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312
## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

Density, alcohol and residual sugar tend to have strong correlations and I will further scrutise their relationships visually.

It is evident from the above graphs that density and residual sugar are positively correalted as we know the fact that higher values of residual sugars have higher densities. Whereas, Density and alcohol are highly negatively correlated.

There is slight inference obtained from residual sugar and alcohol graph, lower levels of alcohol have higher sugar content and vice versa.

BIVARIATE ANALYSIS

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

My main feature of interest was the quality variable with fairly correlated variables such as alcohol, density and chlorides.

My prime finding was the interesting relationship between quality and alcohol. In this relationship, as the alcohol content increased from 11 to 13 the quality of wines too increased from 8 to 9 ratings. The corelatiodoes not remain same throughout and it fluctuates between these variables as higher alcohol content will deteriorate the quality of wine as it might taste too strong.

The linear model also confirmed the same showing lot of scatter points results in a non-linear relationship. Therfore, apt amount of alcohol content is a crucial factor considering the quality of wine.

Also the density and alcohol were positively correlated owing to the fact that as alcohol content increased the density of wine too went high.

Thus a high rated quality wine will be more dense in nature.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There was a strong positive relationship between density and sugar variables i.e, High sugar content will have higher densities. Also, alcohol and density are negatively correlated ie, higher the alcohol content lesser density of the wine.

There was a marginal relationship between alcohol and sugar i.e, lower levels of alchol tends to have higher sugar content.

Therefore, these variables with suitable quanity will play a significant role in obtaining a wine with good quality.

What was the strongest relationship you found?

The relationship between alcohol and density was found to be the strongest.

C. MULTIVARIATE ANALYSIS

Multivariate Plots Section

We know that the residual sugar showed a bimodal distribution. I know cut function can be used to convert this variable and other variables of interest into categorical.

Let us convert the variables now.

## 'data.frame':    4898 obs. of  20 variables:
##  $ fixed.acidity        : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity     : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid          : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar       : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides            : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide  : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide : num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density              : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                   : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates            : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol              : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality              : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ quality.int          : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ ft_sulfur.dioxide    : num  0.265 0.106 0.309 0.253 0.253 ...
##  $ rsa                  : num  2.352 0.168 0.683 0.859 0.859 ...
##  $ residual.sugar.concat: Factor w/ 4 levels "(0.6,1.2]","(1.2,5.2]",..: 4 2 3 3 3 3 3 4 2 2 ...
##  $ density.concat       : Factor w/ 5 levels "(0.94,0.987]",..: 5 3 4 4 4 4 4 5 3 3 ...
##  $ chlorides.concat     : Factor w/ 4 levels "(0.008,0.036]",..: 3 3 3 4 4 3 3 3 3 3 ...
##  $ alcohol.concat       : Factor w/ 4 levels "(8,9.5]","(9.5,10.4]",..: 1 1 2 2 2 2 2 1 1 3 ...
##  $ quality.concat       : Factor w/ 3 levels "(2,5]","(5,7]",..: 1 1 1 1 1 1 1 1 1 1 ...

We see that the above variables distribution follows interqaurtile range as obtained in summary. I would like to interpret quality variables according to their three different splits i.e, Quality ratings less than 3 are considered poor, ratings between 3 to 6 are considered as intermediate and anything above 7 as good quality wine.

Variables of interest as discussed before will be analyzed in a single plot.

The plot reveals that at lower levels of alcohol, density is more and vice versa. Also, Higher levels of density shows high residual sugar content which are visible through purple dots. Higher alcohol levels are associated with lower residual sugar contents which are visible through dense green points.

Analyzing the variables that had moderate correlation with quality variable in a single plot.

For interpretation, this single visualization is not quite clear but it can be noted about the chloride content from this visualiztion that, higher chloride content can be seen at lower levels of alcohol.

Also we see a dense scattering of the same at quality levels ranging between 5 to 7. In addition to that, Higher chloride content occurs at higher alcohol levels with quality level ranging between 6 to 8. There is also negative correlation between density and quality which is evident from this plot as tiny size droplets of density are found more at the top of the chart than at the bottom.

NOW, I would like to evaluate density and chlorides features against the quality variable for different levels of alcohol to further clarify the inference obtained by using median feature.

The First chart exemplifies that as alcohol content increases the median density decreases across all quality levels and i the same consistentancy throughout my analysis.

Second chart exemplifies that median chlorides value increases for the lowest alcohol range as the quality ratings increase from 6 to 8 and for other alcohol ranges the median value of chlorides decreases for increase in quality rating from 6 to 8. To summarise, high chloride content are found in lesser alcohol ranges.

I would like to investigate more on the offset features before i begin to create a model for quality.

I have listed out my inferences below which are drawn from the facet plots created above.

Good quality ratings tend to have,

The effect of Free and total sulpfur dioxides values are more dense in poor and medium quality ranges.But good quality range are not so dense and are less focussed.

The sulphates values are more dense within the range 0.3 to 0.7 and residual.sugar content is lesser than 20.

The volatile acidity values range between 0.2 and 0.5 for good quality wines and the Fixed acidity values range between 5 to 8.

The Citric acid values range between 0.3 and 0.5 with denser points and pH values between 3 and 3.5.

I have made some noticable ranges for features owing towards good quality wines. These features may have little impact towards quality as they have partial correlations with quality variable. Even if they have a weaker corelation, they cannot be neglected. Even minute features needs to have a close attention for a good quality wine.

## 
## Call:
## lm(formula = new_quality ~ alcohol + density + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + volatile.acidity + fixed.acidity + 
##     pH + citric.acid + sulphates + residual.sugar, data = ww)
## 
## Coefficients:
##          (Intercept)               alcohol               density  
##            1.482e+02             1.935e-01            -1.503e+02  
##            chlorides   free.sulfur.dioxide  total.sulfur.dioxide  
##           -2.473e-01             3.733e-03            -2.857e-04  
##     volatile.acidity         fixed.acidity                    pH  
##           -1.863e+00             6.552e-02             6.863e-01  
##          citric.acid             sulphates        residual.sugar  
##            2.209e-02             6.315e-01             8.148e-02
## 
## Call:
## lm(formula = new_quality ~ alcohol + density + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + volatile.acidity + fixed.acidity + 
##     pH + citric.acid + sulphates + residual.sugar, data = ww)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.482e+02  1.880e+01   7.881 3.98e-15 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16
##                              2.5 %        97.5 %
## (Intercept)           1.113373e+02  1.850484e+02
## alcohol               1.460027e-01  2.409487e-01
## density              -1.876695e+02 -1.128988e+02
## chlorides            -1.318480e+00  8.239266e-01
## free.sulfur.dioxide   2.078263e-03  5.387267e-03
## total.sulfur.dioxide -1.026733e-03  4.552383e-04
## volatile.acidity     -2.086208e+00 -1.640146e+00
## fixed.acidity         2.460834e-02  1.064316e-01
## pH                    4.798045e-01  8.928830e-01
## citric.acid          -1.656148e-01  2.097952e-01
## sulphates             4.347243e-01  8.282287e-01
## residual.sugar        6.672953e-02  9.623608e-02

After careful observation of p-values, it is quite evident that the variables of interest such as residual sugar, alcohol, density, volatile acidity has highest t-value.

We have 8 out of 11 predictors seem to be passing the significance test giving a 73% significant predictors accuracy. Apart from the variables of interest which were noticed through various plots, the rest of the significant predictors imply that they should not be ignored in prediciting quality of white wine as they do play a role in affecting the quality in some way.

None of the combinations yielded a good predicitve model of white wine quality. The R squared vale is very less showing bad predictive capability of this linear fit model.

I conclude that The model should perform better by fitting non-linear terms or adding additonal features to it.

MULTIVARIATE ANALYSIS

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The multivariate analysis was a sense of approval or clarification of the valuable insights which i gained in bivariate analysis section.

A clear and precise conclusion can be laid on the relationship that i explored between density, alcohol and residual sugar variables in a single plot.

The Median feature analysis helped me strengthen the inference about density and alcohol consistency with quality variable at all levels, however relationship between chlorides and alcohol may change for different quality levels.

I did not stop there, but continued my analysis of Quality with rest of the variables which had weak correlations by providing valuable insights.

Were there any interesting or surprising interactions between features?

The relationship between alcohol and quality intrigued me and i found it interesting throughout my analysis as there was a positive correlation between alochol and quality ratings.

I also conclude that there will be a point of threshold for alcohol where further increase in its content will deteriorate the quality of wine resulting in the wine turning out to be too strong.

Thus, right amount of alochol content is considered a crucial factor for producing good quality wine and so the quantity is considered to be one of the most important feature in my analysis.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I built a linear Regression model to predict quality of white wine based on all other features in the data. The model confirmed that these were significant variables that i found in my previous analysis as these variables of interest had high t-values.

I obtained a 73% significant variables accuracy which also revealed that there are other additional variables apart from the ones i highlighted that contributes towards predicitng quality of white wine and should not be neglected.

However, the model generated a low R-squared value which means that the model is having poor prediciting capapbility. There are two reasons for this and those can be the requirement of additional variables in the data that are crucial enough to improve the predicitivity or I should be fitting a more flexible non-linear model.

FINAL PLOTS AND SUMMARY

Plot One

Description Plot One

I found this plot as interesting one because i consider alcohol as a significant variable in determining the quality of white wine. As discussed previously, these two variables are positively correlated. It is Quite evident from both the plots that as alcohol content increases there is a decrease in quality ratings initially but further i see a sudden shift in quality ratings as alcohol content started to increase from 6 upto 9.This signifies the permissible content of alcohol that is necessary in making wine.

Also listing out another finding is that, if alcohol content further increases beyond 9 % by volume then there are chances that quality of wine will deteriorate rapidly as wine will taste very strong with high alcohol content.

This gives me a proof with a strong fact that alcohol is a key variable for quality of wine produced.

Plot Two

Description Plot Two

Based on the variable of interest, Density was my second priority among all variablesvand hence i chose this plot. Density has correlation with quality, alcohol and residual sugar and plays a crucial role in building suitable inferences to my analysis.

This Histogram reveals that higher quality of wine has lower density levels and preferably in and around 0.992 g/cm^3. The Left portion of histogram from the center with lower density levels shows higher quality rating colors.

Plot Three

Description Plot Three

The visualization shares information about relationship between the variables of interest in a single visualization. I converted the residual sugar variable into categorical variable so that i can extract more insights at different levels of sugar.

As evident from the graph, there is a downward shift in residual sugar levels as we proceed from highest to lowest density levels. we observe the Residual sugar content with purple dots ranging between 9.9 to 65.8 and can be seen only at higher density levels and also the lower sugar content between 0.6 to 1.2 through red dots at lower density levels.

As alcohol level rises, density level decreases and this is evident from the downward trend to right in the scatterplot. Thus alcohol is a less dense variable and sugar is a more dense variable in wine.

Purple points which has highest sugar content are more dense at left portion of the graph and this reveals that lower alcohol content has higher sugar levels. I can now conclude that the higher alcohol content will make wine strong and bitter. This plot justifies valuable insights to my whole white wine analysis.

D. REFLECTION

The White Wine EDA project has resulted in me being skilled in using R for exploratory data analysis. I now know all the libraries involved like ggplot2,dplyr, GGally, etc . I started with implementing basic fucntions like dim, str,names, head to understand the structure and nature of white wine dataset which was assigned to explore. I further learnt to preprocess the data through refining the dataset by omitting NA observations and subsetting the data through elimination of unwanted variables. I also applied advance functions like summary, groupby, ggpairs and quantile ranges. Now i can confidently create plots such as scatterplots, boxplots, histograms etc..I can also work on the Clarity of the plots and can highlight them through transperency, limiting the axes, jitter, and facet wrapping. I learnt a great deal about modelling in R. Finally i now know how to build and craft a project in R through R markdown(Rmd).

I found the Mulitvariate section quite challenging as i found it hard to determine the strucuture of plots. Here I needed to build and also in determine which variables should be included and how to build inferences from the plot. I devoted alot of time in gaining valuable insights and also in inter relating the variables. I made this difficult process a easy one by examining the variables of interest obtained in bivariate section and focussing more on these variables in mulitvariate analysis to draw significant conclusions.

I used effective visualization to determine significant relationships and confirming the same through correlations observed between the variables. It was a fun learning experience and an eager process when trying to analyze the same set of variables under different circumstances and drawing new significant information at each and every stage.

Success came in due course while evaluating particular relationships with multiple R tools. It seemed repetitive at that time, because a lot of the insights were the same using different chart types and analysis tools. There are reasons which added value to my skillset and mainly that there may be new insights when the data is displayed differently, and the other reason is that it gave me the opportunity to practice all of the functions that I had learned during the course.

During data preprocessing stage, i just had vivid thoughts about significant variables that i will come across in future analysis. I found it astonishing in the beginning to note alcohol being more inofrmative and significant variable than pH. Later upon multiple analysis, i gained more knowledge and insights about why the idenfied significant variables playa vital role in quality of white wine.

Finally, to understand and perform a perfect analysis i would like to dig deeper into the white wine making process and see if i can include additional variables which might have been left out and that can interest me in my analysis. This will further help in overcoming the model building process and generate a good predicitve model for this dataset. I also want to spend more time on model re-specification by transforming the variables. After finalising and keying down the transformations through regression i want to get the validation done before finalising on a Model to work for.

I have definitely increased my thought process and reasoning abilities in concluding my results in better way. In this way i can work on any dataset now and perform Exploratory Data Analysis. I will keep working on different datasets in future to keep improving as a Data analyst.